true

Load results

load("./processed_data_files/what_we_find_VS_ELM_clust20171019.RData")
doman_viral_pairs = F # if false - add human proteins containing domains
motifs = T # based on Vidal's data

Empirical p-value for seing a domain a number of times

What is the chance of randomly seeing any domain the observed number of times among all proteins that interact with a specific viral protein

printTable(res_count, doman_viral_pairs = doman_viral_pairs, motifs = motifs)
writeTable(res_count, "./results/domains_empirical_p_value.tsv")
plot(res_count)

Fisher test for co-occurence of binding viral protein and containing a domain

printTable(resJustFISHER, doman_viral_pairs = doman_viral_pairs, motifs = motifs)
writeTable(resJustFISHER, "./results/domains_fisher_test.tsv")
plot(resJustFISHER,IDs_interactor_viral + IDs_domain_human ~ p.value , xlab = "Fisher's Exact Test pvalue", breaks = seq(-0.01, 
                                                                                                                         1.01, 0.01))

Combining the p-value for seing a domain a number of times and the p-value for co-occurence of binding viral protein and containing a domain

Multiplying p-values

printTable(resPmult, doman_viral_pairs = doman_viral_pairs, motifs = motifs)
writeTable(resJustFISHER, "./results/domains_empirical_p_value_X_fisher_test.tsv")
plot(resPmult,IDs_interactor_viral + IDs_domain_human ~ p.value , xlab = "Fisher's Exact Test pvalue * \nempirical P value for observing domain in N proteins", breaks = seq(-0.01, 
                                                                                                                                                                             1.01, 0.01))

Multiplying the inverse of p-values

The idea is that low p-values mean higher chances of detecting a signal. I am not sure this is statistically correct, but it allows to remove p = 1.0 domains (because of multiplying Fisher p value by 0, the inverse of empirical pvalue for the frequency).

printTable(resPmultInv, doman_viral_pairs = doman_viral_pairs, motifs = motifs)
writeTable(resJustFISHER, "./results/domains_inv_empirical_p_value_X_inv_fisher_test.tsv")
plot(resPmultInv,IDs_interactor_viral + IDs_domain_human ~ p.value , xlab = "Inverse of Fisher's Exact Test pvalue * \ninverse of empirical P value for observing domain in N proteins", breaks = seq(-0.01, 
                                                                                                                                                                                                      1.01, 0.01))

2-step filtering, ranking by Fisher test p-value

printTable(sequential_filter, doman_viral_pairs = doman_viral_pairs, motifs = motifs)
plot(sequential_filter, IDs_interactor_viral + IDs_domain_human ~ p.value, xlab = "Fisher's Exact Test pvalue", breaks = seq(-0.01, 
                                                                                                                             1.01, 0.01))

PermutResult2D(res = sequential_filter, N = 500, value.cols = c("p.value", "Emp.p.value")) +
    ggtitle("2D-bin plots of 250 top-scoring viral protein - human domain pairs, \n statistic: count of a domain among interacting partners of a viral protein")
## Warning: Removed 242 rows containing non-finite values (stat_density).
## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 242 rows containing missing values
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 278 rows containing non-finite values (stat_bin2d).
## Warning: Removed 242 rows containing non-finite values (stat_density).

writeTable(sequential_filter, "./results/domains_sequential_filter.tsv")

How all these methods perform at finding ELM domains?

The absolute numbers

compared the total number of domains found / the total in ELM

The enrichment in ELM domains over the background

P-value for the enrichment in ELM domains over the background